Overview

This notebook explores some of the different options and considerations regarding the model structure. Generally, the model takes worker-level traits as data, and predicts the mean and standard deviation for each colony of each species, with environmental effects estimated for both of those quantities.

clny_min <- 3
rng_thresh <- 300
response_vars <- c("v", 
                   "HeadWidth", 
                   "HeadLength",
                   "HeadShape",
                   "DVE",
                   "WebersLength", 
                   "PronotExp",
                   "ScapeProp",
                   "RelLegHind")
X_vars_mn <- c(minTwarm="minTwarmest",
               Pwarm="PwarmQ",
               Pcold="PcoldQ",
               NPP="npp",
               North="aspectN",
               Day="SampleDate") 
X_vars_sd <- c(minTwarm="minTwarmest",
               Pwarm="PwarmQ",
               Pcold="PcoldQ",
               NPP="npp",
               TAR="TAR",
               Day="SampleDate") 
genera_incl <- c("Myrm", 
                 "Mani",
                 "Lasi",
                 "Temn",
                 "Tetr",
                 "Lept",
                 "Apha"
                 )

Standardization

Issue overview

Colony-level trait distributions are modelled as a function of the environment along the elevational gradient. It is desirable to standardize the response and covariate values for both practical reasons (better model behaviour) and ecological reasons (similar to analyzing coefficients of variation). A basic question is thus how to standardize.

For many traits, we might expect that the standard deviation scales with the mean. For example, larger species will have larger standard deviations in size.

Standardization options

There are three options for standardizing:

  1. Standarize across the full dataset for covariates and for traits
    • One unit change in covariate represents change across full covariate space
    • One unit change in trait represents change across full trait space
    • Slopes: effect of absolute covariate on absolute trait
    • Expect variation in scale of response among species, since some species occupy a larger proportion of the trait space
    • Species with small trait values are less likely to show detectable trends
  2. Standardize within each species for covariates and for traits
    • One unit change in covariate represents change within the species’ covariate space
    • One unit change in trait represents change within the species’ trait space
    • Slopes: effect of relative covariate on relative trait
    • Covariate space and trait space is translated to each species
  3. Standardize across the full dataset for covariates and within each species for traits
    • One unit change in covariate represents change across full covariate space
    • One unit change in trait represents change within the species’ trait space
    • Slopes: effect of absolute covariate on relative trait
    • Similar to analyzing effect of 1ºC change in temperature on trait CV

I think there are plausible arguments for all three.

Method 1 lets us investigate the effect of an absolute change in environment on an absolute change in traits. A 1ºC increase in temperature translates to a 5\(\mu m\) decrease in mean worker body size on average, across all species. Species’ slopes can fluctuate, and genus-level effects could be informative given the clustering of many traits by genus, particularly for the intercepts.

Method 2 assumes that species with narrower environmental spaces are more sensitive to changes in the environment. Within the range of temperatures each species experiences, a shift from the minimum to the 25th percentile equates to a shift from the maximum body size to the 50th percentile, on average across species. Everything is a shift within the relative space of a species.

Method 3 uses aspects of both, assuming that species experiencing a broader range of environmental conditions will show larger changes in traits. A 1ºC increase in temperature translates to a shift of 25% within the species’ trait space. If a species spans a smaller temperature range, we expect a smaller change in its traits, but the traits are still measured relative to the species.

To illustrate this graphically:

Comparison

With Methods 2 or 3, species-level slopes should be more identifiable since the trait range is standardized to mean=0 and sd=1. With a hierarchical structuring of the slopes, the overall effect \(\beta\) of the environmental variable is determined by the species-level slopes \(b\), such that \(b_s \sim Normal(\beta, \sigma_b)\). These could be clustered taxonomically or phylogenetically, or without any structure. Without structure, \(\beta\) is effectively the mean of the slopes of the blue lines above. With structure, it’s the taxonomically- or phylogenetically-weighted mean.

Similarly, Method 2 should lead to even greater identifiability since both the covariate-space and the trait-space are standardized within each species, but the interpretation is a bit strange, since it relies essentially on traits being more sensitive to the environment for species with narrower environmental ranges. The overall effect \(\beta\) would represent the average change in a species’ trait-space given a step in the species’ environment-space. Alternatively, species with narrow environmental ranges may simply show no relationship to the environment. This should also be the case with Method 3. If I were looking at 1-3 species separately rather than in an aggregated framework, Method 2 is what I would do.

We do indeed see that species with larger environmental ranges tend to have more variable workers and more variable colonies, whether we use the absolute standard deviation or the CV, using elevational range as a proxy:

Conclusion

The slopes can always be back-transformed if needed, though it is a bit complicated.

However, I think Method 2 or Method 3 would be better than Method 1 for the practical purposes of the slope identifiability. Therefore, going forward, the response variable will be the trait value standardized within each species.

It is less clear between Method 2 or Method 3, and I’ve gone back and forth a lot.


Genus-level effects

Issue overview

Obviously, species are clustered within genera, and species within a genus are typically morphologically quite similar to each other. A hierarchical framework for the slopes can naturally and easily incorporate taxonomic or phylogenetic structure to the responses. Conversely, while the trait values themselves are similar among congeners, it isn’t clear to what extent the patterns of intraspecific variation across environments are similar among congeners. That is, with Method 1, there would likely be genus-level effects on at least the species-level intercepts for the trait means. Similarly, it is not clear how much the patterns of intra-colony variance would cluster by genus.

As it is, I have versions with and without genus-level effects. One argument in favor of including them is that the taxonomic coverage is not complete. Rather, several Myrmicine genera are fully measured, while only a few Lasius species are included, and a single Formica species. The estimates of \(\beta\) will be more heavily weighted toward the responses of Myrmica, Lasius, Temnothorax, and Tetramorium if there is no hierarchical structure.

It may be possible to instead include a species-level covariance matrix based on the phylogeny, if that’s something they plan to do with the genetic data. Alternatively, I could include a genus-level phylogeny with the genus-level hierarchical slopes.

Conclusion

Either keep genus-level effects, or use a species-level phylogeny.


Colony-to-Predicted \(\sigma\)

Issue overview

This model has many levels. Workers \(i\) from colony \(j\) of species \(s\) are drawn from colony-level trait distributions:

\[\begin{equation} y_{i,j,s} \sim Norm(\bar{y}_{j,s}, d_{j,s}) \end{equation}\]

with intra-colony means \(\bar{y}_{j,s,}\) distributed about regression lines based on environmental variables:

\[\begin{equation} \bar{y}_{j,s} \sim Norm(\mu_{j,s}, \sigma_{1,s}) \end{equation}\]

\[\begin{equation} \mu_{j,s} = X_j b_s \end{equation}\]

and intra-colony standard deviations \(d_{j,s}\) distributed similarly:

\[\begin{equation} log(d_{j,s}) \sim Norm(\delta_{j,s}, \sigma_{2,s}) \end{equation}\]

\[\begin{equation} \delta_{j,s} = V_j a_s \end{equation}\]

There are therefore multiple error terms, each of which could be treated as species-specific. I had found convergence issues and thought I figured it out, but now I’m not certain what I was doing. The full model as described above fails to converge. I think my solution was to adjust \(\sigma_{x,s}\) to be global rather than species-specific, but I’m not sure. I’m re-running models now to investigate.

As it is, convergence looks good for 5_aI_bIS models in the global_sigma/, global_sigma/no_gen/, and fixed_delta/ directories. These models all use a global \(\sigma\) for the standard deviation of \(\bar{y}\) and \(log(d)\) rather than allowing species-specific values.

Simplifying assumptions

A global \(\sigma\) for the deviation of the latent colony mean or sd to the prediction from the regression line would mean that this error term is jointly estimated across all species. This assumes that the variation among colonies in identical environments is the same across all species. This is not true, but it may be an unavoidable simplification. Standardizing the traits within each species should help reduce the effect of this mis-specification also.

Under this scenario, the equations above become:

\[\begin{equation} \bar{y}_{j,s} \sim Norm(\mu_{j,s}, \sigma_{1}) \end{equation}\] \[\begin{equation} log(d_{j,s}) \sim Norm(\delta_{j,s}, \sigma_{2}) \end{equation}\]

Alternatively, the fixed_delta/ directory forgoes the prediction of \(log(d_{j,s})\) based on the environment altogether, and instead draws \(d_{j,s}\) from a lognormal distribution with either a global \(\sigma\) or, in fixed_delta/*_aI_*.stan, species-specific \(\sigma_s\). Both seem to converge, but it eliminates the prediction of \(d_{j,s}\) based on the environment.

Conclusion

This assumption will almost certainly be necessary. Either \(\delta\) is predicted by the environment with a global \(\sigma_2\), or \(d\) is drawn directly from species-specific \(\sigma_{2,s}\) with no environmental regression, with variation in colony-level \(d\) modelled as noise.


Robust worker distribution

Issue overview

Because of logistical constraints, I have relatively few workers per colony. It is not out of line with past work, but it is on the low side for estimating standard deviations within colonies. More specifically, I will tend to underestimate intra-colony variance. One option available is to estimate the colony-level distribution using robust regression, where workers follow a Student-t distribution. This introduces one additional parameter, \(\nu\), which of course controls the fatness of the tails. In theory, this reduces the effect of extreme values. For example, if one of the four workers in a colony is exceptionally light colored, that will have less influence on the predicted \(\bar{y}\) and \(d\).

Stan uses a parameterization that uses a location (\(\bar{y}\)), scale (\(d\)), and degrees of freedom (\(\nu\)), so it doesn’t interfere with either colony-level regression. For robust regression to work, \(\nu\) will almost certainly have to be global rather than species-specific, and it will also almost certainly require the same assumption of global deviation from the regression line to the colony-level parameters.

In fact, since there are three locations in the model that assume normal error, there are three places where robust regression could be implemented:

    1. The distribution of worker-level traits within colonies such that \(y_{i,j,s} \sim Student(\bar{y}_{j,s}, \nu_{s}, d_{j,s})\)
    1. The distribution of colony-level means about the predicted mean line such that \(\bar{y}_{j,s} \sim Student(\mu_{j,s}, \nu_{1,s} \sigma_{1,s})\)
    1. The distribution of colony-level standard deviation such that \(log(d_{j,s}) \sim Student(\delta_{j,s}, \nu_{2,s} \sigma_{2,s})\)

Data-based evidence

Each of these three decisions should be based on data. The decision for 2 and 3 are based off of residuals to the regression lines, and consequently the explanatory power of the covariates will influence the width of the error distribution, and the assumed error distribution could affect the explanatory power of the covariates. The distribution of workers within colonies is somewhat less dependent on the covariates, though it is indirectly through the estimation of the colony means. In practice, both \(\bar{y}\) and \(d\) are generally quite close to the values directly calculated from the observed worker traits.

So these are the questions to answer regarding robust regression:

  1. Are traits distributed normally within colonies?
  2. Are colony means distributed normally about mu?
  3. Are colony log(sd)’s distributed normally about delta?
  4. If the answer to any of these is plausibly no, then should df be constant or vary among species?

1. Are traits distributed normally within colonies?

There are only 4 workers per colony maximum, so this is somewhat challenging to answer from the data. I’ll need to look into the literature too to see if I can find anything about the species here. Obviously the assumption of normality would not hold for species that are dimorphic or polymorphic, but that’s one reason I’m not dealing with them.

Below, I compare worker traits from three sources: \(y_{obs}\), \(y_{Norm}\), and \(y_{Robust}\), where \(y_{Norm}\) and \(y_{Robust}\) are simulated based on fitted models. The goal is to assess whether the observed values are better described by a normal distribution or a Student-t distribution.

So it seems that generally there is very little difference, but that the normal distribution performs slightly better. In fact, the residuals of observed worker traits to observed colony means is somewhat narrower than a normal distribution, such that the normal distribution has higher density on the tails than the t-distribution.

In theory, \(\nu\) could vary across species, but as with the \(\sigma\) terms, that leads to a failure to converge.

2. Are colony means distributed normally about mu?

Next, we can look to see whether there is evidence that \(\bar{y}\) should be modelled with a robust regression about \(\mu\). This would allow for additional dispersion in trait means among colonies found in the same environmental conditions.

Again, we see very little difference, with slightly better performance with a normal distribution.

3. Are log(d)’s distributed normally about delta?

I think there’s a bug in the code below, since there are a few colonies in the Normal models with absurd log likelihoods…

Conclusion

It seems best to assume that workers within colonies are distributed normally.


Fixed vs. random effects

Issue overview

In each of the regressions, the slopes could be either global, or vary by species. I think it makes the most sense to allow the relationship to vary by species, with hierarchical slopes as described above in the section on genus-level effects. However, I do have versions of the model that allow for 1) fixed effects, 2) random intercepts, and 3) random intercepts and slopes, as well as each combination for the \(\delta\) and \(\mu\) regressions.